Machine learning models are not always evaluated with statistical rigor. This can lead to inferential flaws when assumptions are made about the underlying and performance data, especially when cross-validation is used. In this paper, a Bayesian method of model evaluation is compared to a non-parametric frequentist method. In addition, a metric for analyzing the fairness of a particular algorithm is tested.
The evaluation techniques were applied to a dataset of student and course data made available by the Open University. A system was built to train and test predictive models of student success. The aim was to predict students at risk of failing or withdrawing from a course using the first 30 days of data extracted from the virtual learning environment. In an applied setting, these predictions could be used to direct additional resources to at-risk students.
The project included creating a database to cleanse, transform, and analyze the dataset. Features were engineered to use as predictive inputs using a combination of exploratory analysis and inspiration from research. Four different subsets of input features were applied to nine different classification algorithms. Both randomized and exhaustive hyperparameter tuning procedures were experimented with, which created hundreds of distinct hyperparameter settings.
The Bayesian strategy provided more conclusive results by determining a “region of practical equivalence” as opposed to an inability to reject the null hypothesis. The results were similar to findings from research, which typically had tree-based ensemble methods in the upper-equivalence region.
The proposed metric for predictive fairness is called the Absolute Between Receiver Operating Characteristic Area (ABROCA). This metric was first introduced at the 2019 International Learning Analytics & Knowledge Conference. A significant relationship between ABROCA and the gender ratio of a course as well as between ABROCA and the ratio of students in a course identifying as having a disability. No significant relationship was found between ABROCA and overall model performance.
Research
Build database to organize data management
Engineer features
Build model pipeline
Run Model Pipeline
Evaluate Models
Main areas related to Educational Data Mining/Learning Analytics
Most common feature types:
Automated vs Expert Engineered
OULAD (Red & 2013-2014), vs 2015 data (Blue)
Describing first30.all_features.age_band
Proportions:
value frequency proportion
0 0-35 16590 0.694956
1 35-55 7091 0.297043
2 55<= 191 0.008001
Null count: 0
count 23872
unique 3
top 0-35
freq 16590
Name: age_band, dtype: object
Example Map of IMD
OULAD (Red & 2013-2014), vs 2015 data (Blue)
In the current English Indices of Deprivation 2019 (IoD2019) seven domains of deprivation are considered and weighted as follows,
Describing first30.all_features.imd_band
Proportions:
value frequency proportion
0 20-30% 2639 0.110548
1 30-40% 2539 0.106359
2 40-50% 2314 0.096934
3 50-60% 2316 0.097017
4 60-70% 2171 0.090943
5 70-80% 2180 0.091320
6 80-90% 2063 0.086419
7 90-100% 2001 0.083822
8 None 0 0.000000
9 0-10% 2303 0.096473
10 10-20 2406 0.100788
Null count: 940
count 22932
unique 10
top 20-30%
freq 2639
Name: imd_band, dtype: object
Describing first30.all_features.region
Proportions:
value frequency proportion
0 North Western Region 2032 0.085121
1 Scotland 2701 0.113145
2 South Region 2278 0.095426
3 Ireland 938 0.039293
4 Wales 1642 0.068784
5 South East Region 1559 0.065307
6 South West Region 1753 0.073433
7 West Midlands Region 1845 0.077287
8 Yorkshire Region 1440 0.060322
9 East Anglian Region 2434 0.101960
10 East Midlands Region 1680 0.070375
11 London Region 2223 0.093122
12 North Region 1347 0.056426
Null count: 0
count 23872
unique 13
top Scotland
freq 2701
Name: region, dtype: object
Describing first30.all_features.highest_education
Proportions:
value frequency proportion
0 Lower Than A Level 9187 0.384844
1 A Level or Equivalent 10541 0.441563
2 HE Qualification 3660 0.153318
3 No Formal quals 225 0.009425
4 Post Graduate Qualification 259 0.010850
Null count: 0
count 23872
unique 5
top A Level or Equivalent
freq 10541
Name: highest_education, dtype: object
Describing first30.all_features.is_stem
Proportions:
value frequency proportion
0 1 16147 0.676399
1 0 7725 0.323601
Null count: 0
count 23872.000000
mean 0.676399
std 0.467860
min 0.000000
25% 0.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: is_stem, dtype: float64
Describing first30.all_features.final_result
Proportions:
value frequency proportion
0 Fail 5728 0.239946
1 Pass 10857 0.454801
2 Withdrawn 4811 0.201533
3 Distinction 2476 0.103720
Null count: 0
count 23872
unique 4
top Pass
freq 10857
Name: final_result, dtype: object
Describing first30.all_features.region
Proportions:
value frequency proportion
0 North Western Region 2032 0.085121
1 Scotland 2701 0.113145
2 South Region 2278 0.095426
3 Ireland 938 0.039293
4 Wales 1642 0.068784
5 South East Region 1559 0.065307
6 South West Region 1753 0.073433
7 West Midlands Region 1845 0.077287
8 Yorkshire Region 1440 0.060322
9 East Anglian Region 2434 0.101960
10 East Midlands Region 1680 0.070375
11 London Region 2223 0.093122
12 North Region 1347 0.056426
Null count: 0
count 23872
unique 13
top Scotland
freq 2701
Name: region, dtype: object
| model_type | mean_fit_time | std_fit_time | mean_score_time | std_score_time | mean_test_roc_auc | std_test_roc_auc | |
|---|---|---|---|---|---|---|---|
| 0 | rforest | 43.546772 | 6.621612 | 0.308331 | 0.079489 | 0.773876 | 0.006709 |
| 1 | etree | 22.712446 | 2.100217 | 0.221903 | 0.027837 | 0.771116 | 0.004962 |
| 2 | hxg_boost | 8.673935 | 0.686707 | 0.297604 | 0.038754 | 0.770126 | 0.004659 |
| 3 | mlp | 13.990430 | 0.143472 | 0.077275 | 0.011556 | 0.769375 | 0.007105 |
| 4 | hxg_boost | 6.469628 | 1.027840 | 0.263747 | 0.042775 | 0.769049 | 0.006249 |
| 5 | etree | 14.824491 | 3.315720 | 0.251914 | 0.041687 | 0.767965 | 0.004917 |
| 6 | hxg_boost | 1.569689 | 0.177629 | 0.086444 | 0.013947 | 0.767947 | 0.005761 |
| 7 | mlp | 1.304684 | 0.044562 | 0.029652 | 0.002250 | 0.767719 | 0.008325 |
| 8 | rforest | 21.928703 | 4.551270 | 0.763771 | 0.412899 | 0.767640 | 0.005182 |
| 9 | hxg_boost | 1.494005 | 0.194275 | 0.075681 | 0.012553 | 0.767497 | 0.004604 |
Output from test on results from single dataset
Output from test on results from single dataset | Ranked Sign Test
Output from test on multiple datasets:
Initial Database Schemas
Example of GridSearch Cross-Validation for hxg_boost: clf__learning_rate [0.1] clf__random_state [None] clf__learning_rate [0.01] clf__random_state [None] clf__learning_rate [0.001] clf__random_state [None]
Example of RandomizedSearch Cross-Validation for dtree: clf__splitter ['best'] clf__random_state [None] clf__min_samples_split [84] clf__min_samples_leaf [2] clf__max_features [None] clf__max_depth [52] clf__criterion ['log_loss']
Example of RandomizedSearch Cross-Validation for ada_boost: clf__learning_rate [0.016322649720355895] clf__random_state [None]
Example of RandomizedSearch Cross-Validation for hxg_boost: clf__interaction_cst ['no_interactions'] clf__l2_regularization [0.24841032861067147] clf__learning_rate [0.0180795552628978] clf__max_bins [203] clf__max_depth [32] clf__max_iter [30] clf__min_samples_leaf [6] clf__random_state [None] clf__warm_start [False]
Example of RandomizedSearch Cross-Validation for rforest: clf__bootstrap [True] clf__criterion ['log_loss'] clf__max_features ['log2'] clf__max_samples [0.36012871046936756] clf__min_samples_leaf [6] clf__min_samples_split [7] clf__n_estimators [150] clf__n_jobs [-1] clf__oob_score [True] clf__random_state [None]
Example of RandomizedSearch Cross-Validation for etree: clf__bootstrap [True] clf__criterion ['gini'] clf__max_features ['sqrt'] clf__max_samples [0.4409163442397114] clf__min_samples_leaf [5] clf__min_samples_split [8] clf__n_estimators [43] clf__n_jobs [-1] clf__oob_score [True] clf__random_state [None]
Example of RandomizedSearch Cross-Validation for knn: clf__weights ['distance'] clf__p [1] clf__n_neighbors [5] clf__n_jobs [-1] clf__leaf_size [72] clf__algorithm ['ball_tree']
Example of RandomizedSearch Cross-Validation for logreg: clf__C [0.050188292655831426] clf__max_iter [40] clf__n_jobs [-1] clf__penalty [None] clf__random_state [None] clf__solver ['lbfgs']
Example of RandomizedSearch Cross-Validation for mlp: clf__activation ['identity'] clf__alpha [0.024419617927526942] clf__early_stopping [True] clf__hidden_layer_sizes [187] clf__learning_rate ['invscaling'] clf__learning_rate_init [0.008063494205721687] clf__max_iter [46] clf__power_t [0.05170481714548528] clf__random_state [None] clf__solver ['adam']
Example of RandomizedSearch Cross-Validation for svc: clf__C [0.20456796576588798] clf__degree [4] clf__gamma ['auto'] clf__kernel ['poly'] clf__probability [True] clf__random_state [None]
Example of RandomizedSearch Cross-Validation for compnb: clf__alpha [0.010936544356329807] clf__norm [True]
| model_type | mean_fit_time | std_fit_time | mean_score_time | std_score_time | mean_test_roc_auc | std_test_roc_auc | |
|---|---|---|---|---|---|---|---|
| 0 | rforest | 43.546772 | 6.621612 | 0.308331 | 0.079489 | 0.773876 | 0.006709 |
| 1 | etree | 22.712446 | 2.100217 | 0.221903 | 0.027837 | 0.771116 | 0.004962 |
| 2 | hxg_boost | 8.673935 | 0.686707 | 0.297604 | 0.038754 | 0.770126 | 0.004659 |
| 3 | mlp | 13.990430 | 0.143472 | 0.077275 | 0.011556 | 0.769375 | 0.007105 |
| 4 | hxg_boost | 6.469628 | 1.027840 | 0.263747 | 0.042775 | 0.769049 | 0.006249 |
| 5 | etree | 14.824491 | 3.315720 | 0.251914 | 0.041687 | 0.767965 | 0.004917 |
| 6 | hxg_boost | 1.569689 | 0.177629 | 0.086444 | 0.013947 | 0.767947 | 0.005761 |
| 7 | mlp | 1.304684 | 0.044562 | 0.029652 | 0.002250 | 0.767719 | 0.008325 |
| 8 | rforest | 21.928703 | 4.551270 | 0.763771 | 0.412899 | 0.767640 | 0.005182 |
| 9 | hxg_boost | 1.494005 | 0.194275 | 0.075681 | 0.012553 | 0.767497 | 0.004604 |
Scripting Language of Choice
Database System
Database Communication
Database Communication & Data Manipulation
Visualizations
Visualizations
Array Computation
Bayesian Statistical Tests baycomp by: - Janez Demsar - Alessio Benavoli - Giorgio Corani
Random Variables & Statistical Tests
Machine Learning Models & Components